DOCOHEIT BESOHE 



BD 155 215^ 

AOIHOR 
TITLE 

IKSTITQ7I0N 

SPOHS AGEHCI 

PUB DATE . 

CONTRACT 

NOTE 

BDBS- PRICE 
s DESCRIPTORS 



95 



TM 007 267 



IDENTIFIERS 



.Porter, Andrew C; And Others 

Impact on Hhat?: The Iiportance of Content Covered. 
-Research Series Mo. 2.. 

Hichigan State Dniy., East Lansing. Inst, for 
Research on Reaching. 

National Inst. o£» Edacation (DH£i), Hashington, D.C. 

Basic Skills "6 roup. Teaching Div. 

Feb 78 • ' , 

400-76-0073 

37p. / 

!!F-$0.83 HC-$2«06 Plus Postage. 

♦AchieveMent Tests; ArithMetic; ♦Content Analysis; 
♦Course Content; ♦Course Evaluation; Elesentar; 
^ School Hatheiatics; Evaluation Criteria; BvaluaHon 
flethods; Grade 4; Intertediate Grades; ♦I tea - 
Analysis; Progras Evaluation; Standardized Tests; 
Tests of Significance: .♦Test Validity 
♦Content Validity 



ABSTRACT 

Defining practical significance Jin prograa 
evaluations is a difficult aeasureaent problei which can only be 
solved ty an intiaate faailiarity with the 'Measures on which effects ' 
are estiaated and their content relationship to- the prcgraa goals. 
Past atteapt)? to jrovide general solr^ions tc the size of effect 
problea have) relied on standardized adices which can he estiaated 
and reported withpat any knowledge c what was aeasur^. Such efforts 
are viewed here as -steps in the wrong direction. Instead, what is 
called for is a procedure whereby the content goals of the prograa, 
the content iaplied by a test, ^and the interrelationship between the 
two are lade explicit.. The procedure should investigate \ 
treataent-by-itea interactions and at the saae tiie, describe the 
Measures used so that persons other than the evaluator can reach 
their own decisions about practical significance. Analysis of the ' 
■atheaatics sections of four aajor interaediate ^evel standardized 
tests (Iowa Tests of Basic Skills, Metropolitan Achieveaent Testsi, 
Stanford Achieveaent Tests, and California Test of Basic Skills) with 
their taxonoaies indicated rather substantial, differences in content 
tested. It was clear that standardized tests are not well suited to 
the task of estiaating itea doaain by treatment interactions. 
(Author/CTM) 
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Abstract * ' " , 

Efforts to ^define the impact 06 programs have resulted in an 
Important distinction between the statistical questi6n of reliability 
of effects and the measurement question of size- of effects. Pre-* , ' 
seated here is a discussion of the size of effect question/ The size 
of any program effect, however, cannot be interpreted without first 
knowing how. the general effect has been constructed from Its components. 
' In tHis paper, the authors review points- made in the literature on the 
si2e of effect question and then focus on the more fundamental question 
of 'the oonartruction of an aggregate program effect. 

In a mathematics program, for example, the same aggregate effect 
tntght be produced by a'larga gain ig computational" skills or by a 
large gain in understanding of mathematical concepts* ''Individuals or 
school districts, however, may place a higher value on' one of these 
'two areas. Similarly, an effect in an area to which considerabl'e pro- 
gram resources were devoted would have 'different meaning than an ef'fett 
in an area to which no resources were devoted ♦ 

The selection or construction of measures af p;rogram effects (start- . 
dardized t'^sts, for example^ Is thus a crucial issue ia evaluation. 
Clearly, a jsmall aggregate effect on a test in which all .parts are con- 
sistent with the program goals has different meaning from a small 
^agg^egate effect oh a te^t which has only 50% overlap with th<^ program 
goals ♦ ' . . , ^ 

Standardized norm referenced tests are typically designed to ^ 
maximize^ Individual differences and are not necessarily well suited to 
estimate program impacts. -Rather, tests should be chosen or constructed 
6n the basis of, the content or .goals of the program" to be evaluated* 

In a currents IRT study *of the content of fourth grade mathematics, 
a method of describing content was ^^veloped through an iterative process 
of analysis and classification of items on ^Standardized tests, beginning. 
• with the mathematics sections .of, the most widely used standardized tests: 
the'Stanford Achievement Test (SAT^ , the Iowa Test of Basic ^Skills (Iowa),/ 
the Metropolitan Achievement Test (MA^) 1 and the Californiavj:6st of 
Basic Skills (CTBS). ; / . \ , 

Substantial differences, were' found among the standardized tests* ' 
On the" Iowa, 40% of the items were .story problems, compared .to 22% for « 
the CTBS, 'for examplel 'Clearly, the .standardized tests selected caji 
interact with the content of instruction in ways that could produce 
dramatically different aggregate estimates 'of program impact.^ ' 

Such analyses of tests and instructional raJiterials lead to new 
approaches in f>rogram evaluation. Test sel-ection and construction can 
be improved by attention to the content areas emphasized* Analysis, of 
materials can be used to provide, a better match between instruction ancT 



evaluation. The analysis might alsp^ be j^xtended to the content pre- 
sented in programs, which m'ight prove useful for comparisons of 
programs or for studies of progr§ra implementation* 

* > 

Once the content areas covered by a measure and the procedure 
used to aggregate effects in these areas is understood, thejproblem' 
of size of effect must still be addressed,* But no sensible solution 
can be offered until the aggregation in the outcome measure is better 
understood. ^ - • 
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•'^IMPACt on WHAT?: THE IMPORTANCE OF CONTENT COVEto) 



\ 



by Andrew -C. Porter, 2 3 

.William H.. Schmidt, Robert E'.' Ploden, \ 
and Donald J. Freaman 



Introduction 



. Whfen defining' the impact qf programs it is important; to^ distinguish 
between the reliability of effects (a st^atistical question) and the ^slze 
of effects (a measurement question). The- statistical questloh h'^s al- 

N • • • 

•ready been adequately, defined, 'but the measurement question has not. ^ 
Traditlonallye *t:he measurement question has been stated: ''What size 
must a program effect attain to- be practically significant?'' Practical 
significance, in turn', has bee^ deliberately )or 'inadvertently equated 
with various indices such ^s statistical significance,, strength of an 
association, or standard deviation units. These efforts to define 
practical significance, however, disregard the fact that any program 
effect is estimated with an- aggregate measure which cannot be interpreted 
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without first considering the components aggregated, , The primary focus 
oi this paper, thecefojre, the fundamental.^ question- of the' cofnposritiou 
of jBn aggrega^te program effect. The analysis begins Wifeh a critical 
review of efforts to define practical significance. It then focuses 
on empirical unidimensionality and item- by- treatment intexjactions , 
two concepts which are central to th-e intetpretation of aggregate 
program effects.' ffhe paperc conclud^'s with an jLllifsitrat^^ve content* 
analysis of standardized mathematics tests and a brief discussion of * 
how such content analy^e^ are, significant for the size of effects 
problem,- \ ' - • 

• • • . ^ " ^ 

Past Efforts to Define Practical Significance ' 

Four general problems with past attemj^ts to assess the size of 
" * . « r 

effect can be identified. First, many researchers confuse practical 

significance,, with^statisti-c^l significance; neif.he^r type of significance 

implies the other. This confCision is perhaps tHe most frequent Jhis in- 

terpretation pf the size of program effects. In the behavioral sciences 

this confusion may well stem from an historical preoccupation^^th testr 

ing the null hypothesis. Since the' results of tests for statistical 

significance are on a ^dichotomous scale' (significant *or not), there is 

little information immediately available to provide further guldanqe* 

Some^investigators have attempted to squeeze extra meaning from si'gnifi- 
^ • , , A. 

cance tests by reporting results "from "almost signif irfant" to "highly 

significant." It is well, known, however, that any nontrivial null 

hypothesis can be rejected* given sufficient precision of analysis. For 

exampre, an F. test statistic for differences of means h^s sample size 



In Its numa.retor* and so can be manipulated quite independently of ^ the ^ ^ 
effect being, investigated. In that sense, tlie null hypothesis can be ' 
thought of as a "^traw man*' (Kempthorne S Folks^ 197-1, p. 347) and 
any failure to re'ject it as a Type II eirror. ^ Evidence of statistical 
significance, therefore,, is not sufficient to support policy* , ^ 

Iterrison and .Henkel (1970) have provided an interesting collection 
of articles dealing with the sig^ifica^icc test controversy, several of 

whici) comment directly on the important distinction between practical 

J- 

• and statistical significance. , VJhile none of the articles provided an 
, answer for de.fining practical sigty-f icance, it was pointed* out thet 
value judgments are at issue in defining statistical significance as , 
well as in defining practical significance. In the case of, statist'ical 
significance, however, the value question is^tebolved thVough convention 
when the invest igaj/l on agrees on one of two levels, of sigpijficance. 

In that saipe collection. Gold (1969) indicated a second difficulty 
in defining the importance of an effect;^* the importance of an effect 
' ofoa ^iven size may var> with its location on a scale.' Differe^it utili- 

•ties may be assigned to a fixed increment at different- points on a scale. ^ 

■ ( • . 

Even for interval scale data, a o^e unit effect may "^haVe different mean-. 
, ing depending upon its location. along the scalfe continuum. The possibility^ 
of sltlfting importance was reco^ized Ipng ago in another context when^ * 
Dal ton (1920)- ^stated that ihcreipents in income have progressively less 
utility after the base* income reaches a certain level. 

A*third problem with ""past attempts to define practical sigiiif i'cance 
ois that m^ny size of effect indices are influenced by factors independent 
of the utility of, an effect.^ Thus, single effects may produce wide*Iy 



varying values depending on factors such as populatioii heterogeneity, 
arid amount 6f -measurement error. These dif f icultlTes plague *even such 
"scirVe free" estimates as measures of association and measures in 
standard deviation units ♦ . 

- ■ ' ' / - ■ ' 

Many suggest that reporting an index of Wsociation which is ^ 
relatively insensitive to sample size will avoid the problems suggested' 
\>y equating statistical and practical significance. Eta squared, epsil 
squared, omega squared, and the Pearson correlation^ have all been used 
in an* attempt to indicate the practical • imp'ortance, of observed, relation* 
ships. A substantial body of - J.iterature has 'evolved surrounding the 
relative advantages and disadvantages of theae indices- (e.g.-, Cohen,: 
1969; Friedman, 1968; Hays, 1963; Kennedy, 1970). - In practij'e, the ^ 
small^ differences among indices are pro,babl^ of little importance 
given the imprecision of the, data -from which -they are calculated. 

Furtherifiore, the sampling fluctuation* of the index i^ often" ignor^d^* ^ 

• • • 

Most advocates of m^asu^es* of association first ask if the relationship 

! J * • - • <» 

is significantly different from 2;ero. * Once statistical significance;. 

has been observed, however, it is 'common pr^tice to forget about 



sampling fl'uctuations and interpret the point estimate of association 
as, a parameter. -Thus, If the criterion for importance i^ 10% ox more 
of the -variance accounted for, a satnple R of ^10 or larger is;taken 
to. meet the cjiterion. » . . - . ' • 

Glas^, and Hakstian (1969) have been critical all* indices of 
assocation, at least for use in designs invo living fixed effects J They 
express conc^ern that researchers will be misled^^ into int'erpreting'a 
fixed .effect as though it were random. To the£r concern dt nhould be 



added that tRe measures of '»asfeociatian are all functions which deper d^ ^ 
u^on the heterogeneity of the population selected for Invest igation^d 
the amount of measurement error, in thfe variables/ Neither hetero- , ^ * . 

••^eneity' nor sieasureroent error is likely to cpvary with practica^l 

' ' ' ' - • ' : ^ ^ * .s * ' 

s^gnifi(lance, however "defined. Cl^atly, tken, defining practical ' . ^ ^ • 
^ • f » * . ' 

' signi;ficance in tbrms of an index of^'assoeiation is potentially- mis - 
leading, . . * . : • 

Even if measures of asr^ociatdon are useful for dec id img, what consti- 
tutes a strong relatlon$j^ip,, this questiop' remains:,; •'How large must* 
,the index be^ to be. practically significant?" It is difficult Co decicfe 
how mych vatiance explained^ Is sufficient to have "practical value 
(particularly if the indJ^pendent variable is qualitative wijth more than ^ 

two levels). Tn reference to esti^mates of aptitude by treatment inter- * ^ 

* * ^ < • 

actions, Crbnbach and Snow (1977) asse'rt that aV/40 difference between* 

^ '-^ • 

s.tandardized regression coefficients "s^eihs likely to be 'theoretically ' ' 

important/' as* is ''a difference between .trans fbrmed correlation coefficient's 

of .424" Op. 56). Wlvile. they provided -no substantive rationale for their 

**criterlon/ they di(J add ^the caveat that ."costs and utilities could 

warrant spec i fying ^ greateV or 'smaller effect sizef^ (f. 56).. ' \ > 

Others a^ho distinguish practical .significance frara statistical signify 

icance h^ve turned 'to expressing effects in- stafida^rd deviatioti' units. 

Criteria such as .5 or more 'standard deviations have beeti 'used when judging 
• ' ' . , (■ » • ■ 

-the.inlportance of findings- frojji evaliatidng (e.g., "westinghoue? LeamiTig • . 
Corporation,, 19690. In addition, st&ndajrd deviation units have n'early*- ' 
universal- application in defining the size 'ofeffect to detected in * 



^rloTi power calculatijone (Brewer; 1972; Cohen,. 1969; Subkoviak' ^ Levin 

%977)*, But which 'Standard deviation should be used? Should it, be 'the 

\ ' \ ' ' • , ' ^ c ' " • \ ^ f . ■ * 

* ^qtiare root of the -error Variance* for testitig th^ significance of an . , 

effect and so perhapi differ from hypothesis to .hypothesis within a 

* study^^ Should it be ^thf standard deviation diefined on individuals *even 

Vhen the unit of ahalysis' is some -aggregate of tn^^v^^^uals?^ These an'd^ 

simil*ar*questions remain unanswered/ * • • ; ^ . .• , • • 

^* Despite the dif flJulttes^ most people* concerned -wifh c'bi\ducting 
' * " * ' ' I ' * ^ \ ^ * ' * ' \ / 

^evaluations ^g^ree' th«t an evaliiator should; decide for-him/hers'ei f Yor 

• ' ^ . \ 
his/her client) what, constitutes pfactic'al significance and -dfe^ign the • 

evalciation and report , the results accordingly (e'ig., Boriich, 1977). ^Of , 

the^*three procedures for definir\g practical significance just .rjeviewed,* - 

^ n 'res of association. provide the mepric. least sensitive* to factors 

conceptually unrelated to the size- of a program effect; in that sense, 

' . ^« ' * 

they seem best si^Jited to the problem of defining practical significance. 

^Regardless of nfetric, howevt what constitutes 6n itnportant eJFfect ^ 

* in an evaluation depends -on* value judgments which may be made in dii^fer- 

• • , - A " 

ent way's by diffetent parties. ' .J • , * 

* , "k - • ' 

The fourth problem with previou? attempt;s ^t defining practical 
significance is poor reporting practice that makes it difficult - if 
not impossible * to reasonably assess *the utility -of s programiief feet- 
, w^hout access to the of lginal\data, The lack of reported information 
labout: the compositions of an outcome measure forces the reader to acce|it 
the- evaluator s' values. Even worsu, *<: ra&y be that the valuator is 
naive about the validity of his measure-j. lit reporting results, there- 
fore, it seems reasonable to strive tp present sufficiertt information 



about the varialjles so that others might exercise their own values when 

Interpreting the findings. , ^ 

Unfortunately, all of the methods^ for defining practical slgnlfl- ' 

cance" considered thus far facilitate the practice of reporting progrdm 

effects without providing information about the composition of the 

dependent variable. Thd extent of this reporting problemis indicated 

in Anderson's review of 130 articles (in the Journal of Education 

♦ 

Psychology and the American Educational Research Journal) from June 
1964 to February 1971 in which one or more homemade tests of reading ^ 
comprehension were used; 

V 

*Mbst investigators reported nothing about their tests beyond 
such rudimentary information as the number of items and the 
response made. Several investigators did not hint that a test 
was used until the analysis of variance was described, at whtch 
point, the test was mentioned no more* One investigator char- 
acterized his test in a single sentence. "Critericm achievement 
was measured by the final achievement test." (1972, p. 165) 

Dimensionality of Achievement Tests 
Aside from the four problems discussed above, the common practice 
of thinking about the size of an effect in terms of an aggregate measure 
is^, in itself, likely to be misleading. An achievement test generally 

assesses Achievement in a number of content areas. Thus, Identical 

f 

aggregate scores on an achievement t'est do not necessarily reflect the 
same level ot achievement across all content areas. In a mathematics 
program, for example, the same aggregate effect might be produced by s 
a large gain in either computational skills or understanding of 
mathematical concepts. The values placed on each of these two areas 
may differ, however. Similarly, an effect ,in an area to which consider- 



able program resources had been devoted would havevdifferent meaning 
than an effect in an area to which no resources had been committed. 

Yet, achievement tests (or at least subtests) are constructed to 
be empirically unidimensional. For example, the mathematics subtests 
of concepts, computation, and applications on the 'Stanford Achievement 
Test (SAT) Intermediate .Level I, Form A are reported to have internal 
consistency reliabilities of .87, .91, and .93, respectively, when 
given to beginning fifth graders. Evidence of internal consistency 

r 

has-been taken. as evidence that all items measure a single trait; 
this brings Into cfuestion the. utility of identifying subsets of items 
(e.g.-, Goolsby, 1966). -There are at least; two reasons why evidence of 
a-test's empirical unidimensionality may be misleading as to the utility 
of identifying subsets of items. The 'first reason stems from the defini- 
tion^of empirical unidimensionality; the second is a function of the 
ways in whicli unidimensionality is estimated. 

The empirical 'definition of unidimensionality calls for a large 
-first factor on the item intercofrelation matrix. Thus, empirical 
unidimensionality is a static concept specific to the time of test 
administration and the population of respondents. 'Consider a population 
of respondent^, and set of items that yield an item intercorrelation 
matrix with eqj^al off-diagonal elements. Suppose half the items require 
division w}.th ♦remaindQr, half the items require multiplication^of three- 

r 

digit nurtibers, and the population of respondents is beginning fourth 
grade students. If experiencing an intervention were to 'uniformly re- 
duca the difficulty of half the items — for example, the intervention 



7 ' 

focused on multiplication of three-digit numbers and did not consider 

division — • the only effect on the item intercorrelation matrix would 

be to create a difficulty factor. The difficulty factor could be 

avoided by use of tetrachoric coefficients (Carroll, 1961). Yet, 

despite empirical unidimensionality (both prior to and after the 

intervention), there is clearly a useful distinction between Jthe 

m 

two subs^ets of items. It is of interest, therefore, to ask whether' 

a test i^ unidimensional relative to an intervention* i.e., does an 

intervention affect all item difficulties equally? Searching for 

differential effects across items is analogous to searching for 

aptitude-by-treatment'^lTiter act ions (ATI's) and might be called the 

search for item-by-treatment interactions (ITI*s). 

Most test data, however, are not confined to individuals receiving 

a single intervention. In education, different students receive differ- 

ent educational experiences, and these experiences may have different 

c 

effects across items. If a test is comprised o,f sets of items defined 

/? i 

by concepts such that the effect of an intervention is constant within 
^^ach set, and if the effects of interventions vary with less-than-perfect 
correlation across sets of items, the sets should be reflected in the 
pattern -of item intercorrelations^ This effect on item intercorrelations 
occurs because the intervention effects contribute to both the covariance 
and variance of items within a set but not to the covariance of items 
between sets. Since data from norm groups o^ standardized t^sts would 
seem to be a case in point, the -fact that they are reported to be 
internally consistent still seems to challenge the importance of ITI's. 
The apparent unidimensionality of standardized tests, however, may only 



1 



rr 



« 

be evidence for the existence of a strong single dimension/ not JFor 
the absence of content factors* If, in the situation just described, 
items were arranged by concepts, the item tr-tercorrelation matrix 
would^ be a super matrix with subraatrices on the main diagonal represent- 
ing within-concept correlations. If ITl's are present, the diagonal 
subraatrices will have higher correlations than the off-diagonal su6- 
matrices, thus yielding a factor for each concept*. (The off-diagonal 
subraatrices could all be equal except for the effects of varying item 
difficultits.) The off-diagonal submatricus will also tend to have 
positive correlations, however, because of individual differences in 
a'^rtrttude and the likelihood of positive correlations between intervention 
effects across sets of items, due to the hierarchical nature of most 
subject matter. The positive off-diagonal subraatrices contribute to 

a single common factor. Using the Spearman- Brown prophecy formula, the 

« 

more concepts included, the stronger the general factor. Furthermore,, 
the fewer items per concept, the less clearly defined the second order 
concept factors. Thus, evidence, of an internally consistent test should 
not be misconstrued as indicating the uselessness in searching for ITl's 
in evaluations using that test. 

When defining practical significance, then, concern for describing 
test content validity for an intervention and the possibility of ITl's 
are I)oth important. Those who h^ve been interested in the possibility 
of item-by-treatment interactions have, for the most part, been relatively 
unconcerned cbout constructing achieveraent tests to reflect the. con- 
tent of interventions (Mandeville, 1972; Moonan, 1955). Recently, the 



11 • 

most visible interest in ITl's has been in the area of detecting bias 
in exif^ting tests (e.g., Cleary, 1968; Jensen, 1976)«. In this context,^ 
few interactions have been fqund, though , Gupta (1969) reported a sex 
by item interaction for Step-Math 2A. Likewise, those who have called 
for careful test construction to reflect the content of interventions 
have not seemed particularly concerned with detecting item-by-treatment 
-interactions (with the exception, maybe, of jiastings, 1966). 

As usual, actual 'practice has lagged well* behind recommended 
practice. The call for program-valid achievement testing in evaluation 
(e.g., Bloom, Hastings, & Madaus, 1971; Nunnally 6e Wilson, 1.975; Shoe- 
maker, 1975) remains largely ignored. The goals of educational .inter- 
ventions are typically vague, makin'g difficult the selection of content- 
valid ^dependent variables. Even when program implementation ^is"^ ^iven 
explicit attention, fhe content goals of the program are inadequately 
considered. In i:he discussion of curriculum change by Fullan and ^ 
Pomfret (1977, p. 361), for example, theire was little analysis of con- 
tent goals (just one of five dimensions considered). 

In a few notable exceptions, evaluations have included carefully 
constructed program-valid achievement measures* Hively, Maxwell, Rabehl, 
Sension, and Lundin (1973) provided a detailed acpount of their domain- 
referenced evaluation of tlje MINNEMAST Project a modern mathematics 
ayid science curriculum for elementary school. lu that evaluation, the - 
authors make extensive use of item forms to represent domains. The 
chapters in Part II of Bloom et al, (1971) also provide illustrations 
of content analyses on which program-valid achievement tests* might be 
constructed. Finally, objectives-referenced test systems which are 

17 
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coromercially available make it possible to corjstruct tailor-made 
achievejnent tests for a few subject matter areas (e»g,, SOuAR Field 
Manual I, 1972), In more general terms, the concept of^ "universe- 
defined" (Osburn, 1968; Hively, Patterson, Page, 1968) or dopain- 
referenced (Hively et al,,i973) tests ~ where. .content is made 
clear through rules of construction — • holds-promise for meeting both 
ITI and reporting concerns. 



. An Illustration Using' Standardized Test Content , 

The mainstream o^ educational evaluation continue^ to rely on ' 
standardized tests of achievement 'as opposed to teats constructed to 
fit a particular need. These standardized tests are d^signed^tq 
evaluate individual student differences on a rather amorphous national' 
curriculum* The market is dominated by the Stanford Achievement Tests, 
Iowa Tests of Basic Skills, California Test of Basic Skills, and the 
Metropolitan Achievement Tests ♦ The methods for selecting one test 
over another in any particular situation are not documented and remain 
unclear* For the most part, it appears that these tests are used inter- 
changeably, with frequent references made to the high intercorrelation 
among corresponding subtests. As «oted previously, however, evidence 
of intercorrelation (internal consistency) can be misleading^n terras 
of interchanging tests* 

Despite frequent attacks on the use of standardized tests for 
program evaluation (Airasian & Madaus, 1976; Cox & Sterrett, 1970; 
shoemaker, 1975), the criticism remains on an abstract level. Far 
too few careful 'analyses of standardized test content have been completed 



is 



13 



analy^s which could demonstrate the link between test and program to 

be evaluated and 'on which searches for ITI's could be based* Jenkins 

and Pany (1976) analyzed .five standardized re'adihg achievement tests / 

of word recognition at grades one aiid two and seven commercial reading 

serifs. After observing differences among both tests and curricula 

and an interaction between the two they concluded, **It appeiars doubtr 

ful that conventional achievement tests can serve as unbiased estimates 
« 

of a curriculum's effect, at least at eavly grade levels" (p. 12). 
While s^me questions can be raised about the construction of their word 
lists, the possibility of^tem-by-t^eatraent^lntexactions and mislead-*' 
ing aggregate effects is supported. An analysis of standardized tests 
and curricula for^eadlng comprehension by Armbruster, Steven, and • 
Ros^nshine (1977) also yielded differences between skills taught and 
skills tested. The categories used for this analysis, however, seemed 
to be more a function of the way in which test questions were asked than 
the content of the text to be read. These categories did not isolate 
vocabulary, sentence construction, sentience length, or complexity of 
concepts, all of which are known to affect comprehension. 

Developing a Taxonomy to Measure Content 

As part of 'our work on teacher decisions about the content of 
instruction, it was necessary to develop a method for describing 
the variety of content taught in fourth grade mathematics. On the 
assumotion that the items in standardized -^.chievement tests of mathe- 
matics at the fourth grade level should reflect that variety, an 
iterative process of analysis and classification of items on the 
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Stanford Achievement Test was conducted. The result of that content 
analysis and classification was the development of a taxonomy (shown 
in Figure 1 at the end of this^ paper)* The taxonomy is a vehicle for 
illustrating test content analyses and their usefulness for defining 
practical significance and investigating ITl's. 

The taxonomy provides an explicit description of the match 
between the content of a program and the content of a test-used to 
evaluate that program* If an intervention addresses the content 

implied by a subset of the taxonomy cells^^ tjien those cells identify 

' . • ' 

item domains that should be included on a test of effects. If there 

are hjrpotheses about transfer or concern for unanticipated negative 

effects, item domains idejatified by other cells in the taxonomy might 

also be included. Again, the taxonomy would help to make such interests 

explicit and so increase the precision with which they are addressed 

in the evaluation.^ 

Reporting the distribution of items across cells in the taxonomy 

should also be ait- effective and ef f icienf way to provide information ' 

necessary to support value judgments about size of effect. Further, 

the taxonomy should be useful-in searching for. item-bytreattnent inter- * 

actions which, if present, make interpretations of ^iggif.egate effects 

difficult* To facilitate the estimation of such interactions, each 

item domain should be represented by a set of items. The number of ^ 

items in a set need not be as large as suggested for reliability in 

r 

individual assessment, since item-by-treatment interactions are de- 
fined on group means rather than individual scores. The standard error 
of a group "mean is directly related to groupsize, and'gtoup size counters 
the i^act of low reliability due to few items defining individual scores 

20 
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Our taxonomy is defined by the "^intersections of three factors: 

(1) mode of presentation (3 levels), (2) nature of the material 

(13 levels), and ,(3) operations (12 levels). The intersections of 

these three factors results in 468 cells. In some respepts the 

• '» # • 

taxonomy may appear to ,be unrealistically detailed, while in c^^hers ^ 

«« 

it m&y api^ear to gloss over important distitictions. Our goal ,was*^to 
provide a level of detail ..sufrici^nt for describing teacher decisions 
about- content of instruction. Clearly, the extent to which there are 

similarities or differences in content between a program and a test is . 

' s . 

a function of tha detail level of the description provided It is im- 

portant to assume that our taxonomy is at "a level of detail such that 
instruction can be directed to some cells and not others* The taxonomy 
has "been reviewed by several teacheru involved in mathematics instruc- 
tion in thfe elementary grades, arid those reviews were generally supportive 
of the assumption* 

The first taxonomy factor Mode of Presentation distinguishes 
between items which oresent essential information in graphs, figures, 
tables, and those which do not. For those items which do not present 
essential information in graphs, figures, or tahles, a further distinc- 
tion is made between items which specify the operation required for ^ 
solution and those which do not (e.g., the typical story problem). 

The second factor Nature of the Material has several levels 
which are fiot mutually exclusive but which are ordered in complexity. 
In using the taxonomy, an item is classified at the highest appropriate 
level of complexity. In using the taxonomy, an item is classified at 
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the highest appropri^ite level of complexlt- . , ascending order of ^• 
complexity, the levels a^re: ^ (1) single digits, (2) single and multiple 
digits, (3) multiple digits, (4) single fractions, (5) multiple fractions 
(6) decifioals, (7) percents, (8) alternative number 'systems ' (e,g» , Roman ^ 
numerals, clock arithrmstiq) ^ (9) place value, (10) number sentences, 
(llJr algeoraic sentences, (unknown Quantities not isolated" by an equal 
sign), (12) conversion from one scale of measurement to another, and 
(13) geome'tric figures, , . 

The third factor Operations — also includes levels which are 
not mutually exclusive and again items are classified at the highest 
levels of complexity appropriate* * Starting with the least c<'iraplex, y 
the levels' are (1) add. (2) subtract without borrowing, (3) subtract 
with borrowing, (4) add or subtract fractions without a common dfenoraina- 
tor, (5) multiply, (6) divide without remaindei>, \l) divide wfth remain- 
der, ^8) combination (more than one of the basic arithmetic pperations), 
(9) grouping (use\of parentheses), (10) identify equivalents (e.g., . 
selec;t the figure with ^fourth of its area shaded), (11) identify rule^ 
(e.g., number series problems), (12)'' identify terms (erseu^'lally voc\bu- 
lary). ' ■ • 

Classifying £he Content of Standardized Tests 

The popularity of standardized tests for use in program evaluation 
makes knowledge of their content important. To that. end and to further 
illystratre the possibility, of trestment-by-item interactions on presum- 
ably unidiraensional tests, our taxonomy has been used to classify 
fourth grade mathematics content on the four most widely used standardize 
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tests: the Stanford Achievement Test (SiiT), the Iowa Test of Basic 
Skills (Iowa) , the Metropolitan Achievement Test (MAT) , and the 
California Test of .Basic Skills (CTBS)* " . 

T'he Items In^the mathematlps subtests at the- fourth grade level 
for all^foui: standardized test batteries were Independently classified 
by three of the authors- •Aslumihg that Agreement between two out of 
three raters makes an item classif i'afele, 96% of all the items could 
be classified. Tnter-rater reliabilities ^re reported in Table 1 by 
test battery, subtest, and dimension of the taxonomy*.' Only those 
items on the Study Skills subtests pertaining to matherafttics were 
classified. The cell entries represent percent of possible pairs of 
rafcers agjreeing; for each item, all three raters agreeing counted as 
three out of three possible pairs andtwb raters agreeing counted 33 
one out of three. Entries in the columns labeled' C of Table 1 represent 
agreement as to the exact cell in the matrix/ As'might^be expected, 
the coraputction subtests^ were described with the greatest' accuracy 
907o or more agreement at the exact cell' level. The concepts, subtests 
contained items most difficult to Sescribe using the taxonomy, with ^ 
exact cell agreements near 60%. The four tests were nearly equdl in- 



-k Iowa Tests of Basic Skills (1971); Level 10; Tests M-1, M-2, and 
appropriate items oa W-2, * , 

Metropolitan Achievement Tests (1970); Elementary Level; Tests 5, 6, & 7. 

Stanford' Achievement Tests (1973), Primary Levtl III (3rd Grade), Inter- 
mediate Level I (4th grade)-, and Intermediate Level II (5th grade); Tests 

4, 5, & 6, ^ ' * ; 

. California Tests of Basic Skills (1968); Level II; Tests 6*& 7- 
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the extent to which they could be accurately describeJ;. tli6 *IOWA was 
a slight exception, parti^cularly since It did not contain a subtest de<- 
voted to computation* ' ' * . 

. The percentages of Items In each test battery at all levels -of 
every dltnensloti are presented in" Table 2* For these data, an item ^ 
was classified by reviewing^xthe independent decisions of the three , 
raters and^resolviqg disagreements to\the. traters' mutual satisfaction. . 

r • • - • , * • 

O ' ^ ' ' . . 

The reliabilities reported in Table 1 represent, therefore, a strong 

lower bound to that for data in Table 2.«' In one sense", the data 'in ' 

Table 2 may be misleading in ,that*.the percentages ,reported for the 

marginals, of the' taxonomy could be in adreemeijt and still there would 



be tio overlap in classification of items from the different* tests 'at 
•the cell level.. To the extjantp that differences occur on the marginals, 
however, the tests do differ in content and at a rather low.. level of 
detail. 

' • . * * » < ' . • • 

For mode o^ presentation, three' of the four testa ajxpeared quite 

V ' • , ^ . ■■ ; ■ > 

sfmilar, but the Iowa had a substantially larger ptoportioii of f terns 

\t ' • ^ ' ' 

where essential information was presented in the form of graphs, figures 

'* * 

and tables. This difference was due, in part, to the^ absence-of a 
computation lubtest on the Jowa but not entirely, since the raw number 
of such items was considereably greater as welL. With the '^exception- 
of the Iowa, roughly 20% of the items involved graphs, figured, or 
tables, and a' little less than a' third of the items required the * 
respondent to figure out the necessary operation (for the most part, 
story problems). 
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! On the nature of materials, there were more similarities than 
differences among the four test batteries* , Still-, sothe important 

.differences existed. For example, the subtotals for the three levels 

« - I 

involving whole numbers* varied from 39 - 66% with -the SAT having the' 

highest percentage. Other frequently-represented le^vels- werte algebrai 

• .1 

sentenced,' at 'roughly 10%, and essential units of measureinent , which ' 

i t . 

ranged from a tow -of 7% on the SAT to a hfgh of 15% on the. MAT. Per- 
cents, alternative number systems, ahd geometric figures were not em- 
phasized on any of the tests. \to provide a better understanding of 

^ these differences it. must be "pointed out thrft for the SAT, a percent 
is about' • 9 of. an item; -Further, an item is equivalent to approxi- ^ 

ornately .2 of .a grade equivalent near the middle of the norm distribu- 
tion on the SAT math subtests.* * 

On tKe operations factor the test's were quite similar in the 

• * J' * 

percentages of Items Involving ^subtract without borrowing (67o - 87o), 
add or subtract fractions without a. common denominator (0% - 2%), 
divide with remainder (1%), and combinations (6% - 87o> * For the re- 
maining levels there \*ere mpdest to strong differences among the tests 
The MAT, for example,, had 21% addition litems, w^ich was about eight 
percentage points more thtfh the other tests. The Iowa "had at least 
five percentage points fewer- multiplica?bion items than did the other 
tests • Grouping was tested by the SAT but not at all by either the 
MAT or the CTBS. . * " ' , 

To provide some sense of how the tests varied in content across 
grade levels, the third and fifth grade levels on the SAT were 'also 
analyzed ♦ The results are reported in Table* 3 and are based on resolu- 
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tlon of any disagreements between two Independent rateVs, .bo.th of' 
whom were also raters for the data in Table 2. The percentage^dis- 

. trlbutlons of Items- across mode of presentation levels remained 

nearly identical frpm the third grade level to the fifth grade*' level j 
Under nature o^ material, ^he percentages fpr items classified as 

^ single digits and -place value decreas€;jd while the percentages ^in- 

creased for items classified , as fractions, decimals, and percents. 

. 'Surf>ri«ingly, the percent of items .classified as algebraic s^ante^nces 
* * • * 

held quite- constant at appfpximately 107o» * ^ • • 

The data in Tables 2 and 3 represent descriptions o'f matfhematics , 

content ^cros^ all subtests us**ng onl^y th^ margir.als of the taxonomy.. 

The data in Figure 2 represent item distributions across the cells of 

the taxonomy for the Concepts subtest of the SAT and the MAT^. The X s 

in the upper half of each cell i%^>resent items on tKe SAT, and the 0*s' - 

' in the bottom half 6f each cell represent items on the MAT., Across the 

►4:wo subtests, items fell into 47 different cells. Of those 47 cells, 

however, only 7 - 15% were common to both tests. While the cell level 

analysis was most dramatic, sizable differences were reflected in com- / 

parisons on the marginals. For example^ 12% of xde MAT items were 
• /» . . . 

classified* Operation Not Specified, while there were no such items on ^ 

♦ * . * ♦ *^ ' 

the SAT. Twenty-three percent of, the MAT itetus involved essential ' ' 

units of measure ^hile- only '^6% of th^ SAT items were classified at 

' that level. The SAT had *targer percentages of items classified as 

group-ing, (6% compared to 0%) , identify rule X 19% compared to 7%) and 

0 

and identify term (22% (Compared to 12%). ' ' 
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It is clear from these content analyses that a total, score on any 
one of the subtests considered represents an aggregate across content 
areas; these aggregates might well vary in their sensitivity to any ^ 
given mathematics intervention. It seems reasonable that a similar 
content analysis of a program to be evaluated would yield hypotheses 
about^potential item-by- treatment interactions . Furthermore, there 
is sufficient variance in content across tests so that some are more 
likely relevant than others for assessing the effects of a^ given inter- 
vention. Finally, a taxonomy i? efficient method for communicating to 
those reading. evaluation reports the information prerequisite to de- 
ciding wh^t constitutes practical signi-f icance. 

Summary 

Defining practical significance in program evaluations is a diffi- 
cult measurement problem which can only be solved by an intimate famili- 
arity with' the measures on which effects are estimated and their con- 
tent relationship with the goals of the program being evaluted. Past 
attempts to provide general solut^'ons to the size of effect problem have 
relied on stanfda^dized indices which can be estimated and reported with- 
out any knowledge of what was measured. 5or this reason, these efforts 
are: viewed here as 4teps in the wrong direction. Instead, what is 
called for is a procedure whereby the content goals of the program, the 
content implied by a test, and the interrelationship between the' 
two are made explicit. The procedure should investigate treatment- by- 
item interactions and at the same time, clescribe the measures used so 
that, persons other than the eyaluator can reach their own decisions about 
practical significance. The taxonomy of fourth grade mathematics illus- 
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trated the possibility'' of obtaining better knowledge about variables 
on which program effects are estimated* A detailed description of 
the mathematics sections of the four major stand^dized tests ob- 
tained with the taxonomy — indicated rather substantial differences 
in content tested. From the analyses, it was clear that the standardized 
tests are not well suited to the task of estimating item domain by 
treatment Interactions, as most cells in the ta;xonoray were represented 
by only one or two itfems. 
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ITEM DISTRIBUTIONS FOR EACH FACTOR ACROSS TESTS' 
FOURTH GRADE LEVEL 



!• SMode of Presentation 

• graphs, figures, tables, etc. 

- operation(s) specified 

- operation(s)' not specified 



II« Nature of> Material 

- single digits 

- single and multiple digits 

- multiple digits 

- total — whole numbers 

- single fraction 

- nniltiple fractions 

- decimals 

- percents 

• alter, number systems 

- place value , 

- number sentences 

- algebraic sentences 
essen, units meas, 

- geometric figures 

- other 
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III. Operations 



- add 

- subtract w/o borrowing 

- subtract with borrowing 

- add or subtract fractions 

w/o common denominator 

- multiply 

- divide w/o rehiainder 

- divide witK remainder 

- combination 

- grouping 

- identify equivalents 

- Identify rule (order) 

- identify terms 
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entries are percents • ^ 

This does not represent a level of Nature of Material, but rather the perirtnt 
of items on each test that could not be^ fit into the taxonomy. 
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ITEM DISTRIBUTIONS FOR EAGl FACTOR ACROSS GRADES 



STANFORD ACHIEVEMENT T.EST 



( 

I. Mode of Presentation 

- graphs, figures, tables, etc. 

- operation (s) specified 

- operation (s) not specified 



11^ Nature of Material 

- single digits . ■ 

- single and multiple digits 

- multiple digits 

- total — whole numbers 

- single fraction^ 

- multiple fractions 
• - decimals 

- percents 

- alter, nutnber systems 

- place value 

- number sentences 

- algebraic sentences 

- e68en. units meas. 

- geometric figures 

- other 
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III. Operations 



add 

subtract w/o borrowing 
subtract with- borrowing 
add or subtract fractions 

w/o common denominator 
multiply 

divide w/o remainder 
divide with remainder 
combination 
grouping 

identify equivalents 
identify rule (order) 
identify terms 
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